Introduction

Background Information

This dataset has information pulled from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO), with additional country specific information collected by the United Nations (UN). There are 22 variables and 2938 observations in this dataset, with 20 of those variables being predicting variables.

Dataset Citation

Description of Dataset

This dataset describes life expectancy of 193 different countries, along with various other factors that may impact life expectancy. Some of these additional factors include, but are not limited to: mortality rate of adults and infants, alcohol consumption, immunizations of specific diseases, deaths by specific diseases, Gross Domestic Product, and population of the country. All of the data relates back to a specific country.

Dataset Interest

Anyone who has stayed up to date with the news will have likely heard about the recent “anti-vaxers” movement. Essentially, people are starting to believe that vaccinations actually have a negative impact on quality of life and life expectancy, and can also lead to autism. While many of these disadvantages of not being immunized, especially the connection to autism, have been disproven, many still do not get vaccinated. Certainly, everyone is entitled to their own choices and beliefs, so we are hoping that an analysis of this dataset may shed some scientific light on both sides of the arguments, right or wrong. While there have been many datasets about life expectancies, there are not many that accompany immunization statistics with it.

Furthermore, we hope, through studying and modeling the relationship between life expectancy and various factors, to identify some of the biggest factors that influence life expectancy. Through this, we could demonstrate the biggest leverage that an under-developed country could take action on fairly easily to improve the life expectancy of their people.

Methods

Import libraries to use

library(lmtest)

Load the data into R

life_expectancy <- read.csv('Life Expectancy Data.csv')

Analyze the plots between life expectancy and the rest of the predictors to do a manual selection of potentially useful predictors

pairs(life_expectancy[, c(4, 1, 2, 3, 5)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 5, 6, 7, 8)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 9, 10, 11, 12)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 13, 14, 15, 16)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 17, 18, 19)], col = 'dodgerblue')

pairs(life_expectancy[, c(4, 20, 21, 22)], col = 'dodgerblue')

Clean up source data by excluding records with NA value in the columns of interest. Also exclude records with unreasonable records. For example Income.composition.of.resources = 0 does not seem reasonable comparing to the overall distribution.

## remove rows with NA in Life.expectancy
subdata=subset(life_expectancy,!is.na(life_expectancy$Life.expectancy))
## remove rows with NA in Adult.Mortality
subdata=subset(subdata,!is.na(subdata$Adult.Mortality))
## remove rows with NA in Alcohol
subdata=subset(subdata,!is.na(subdata$Alcohol))
## remove rows with NA in BMI
subdata=subset(subdata,!is.na(subdata$BMI))
## remove rows with NA in Polio
subdata=subset(subdata,!is.na(subdata$Polio))
## remove rows with NA in Diphtheria
subdata=subset(subdata,!is.na(subdata$Diphtheria))
## remove rows with NA in GDP
subdata=subset(subdata,!is.na(subdata$GDP))
## remove rows with NA in Income.composition.of.resources
subdata=subset(subdata,!is.na(subdata$Income.composition.of.resources))
## remove rows with NA in Schooling
subdata=subset(subdata,!is.na(subdata$Schooling))
## remove rows with Income.composition.of.resources=0
subdata=subset(subdata,subdata$Income.composition.of.resources!=0)
## remove rows with BMI less than 8
subdata=subset(subdata,subdata$BMI>8)

## create a detaset that contains columns that believe have predicting power of life expectancy
life_exp_good_data=subdata[,c(2, 3, 4, 5, 7, 8, 11, 13, 15, 16, 17, 21, 22)]

Fit an additive linear regression model for all available predictors

le_add_model <- lm(Life.expectancy ~ ., data = life_exp_good_data)

Fit an interaction linear regression model up to all 2 way interactions for all available predictors

le_inter_model <- lm(Life.expectancy ~ . * ., data = life_exp_good_data)

Fit an interaction linear regression model up to all 3 way interactions for all available predictors

le_inter_3_model <- lm(Life.expectancy ~ . * . * ., data = life_exp_good_data)

Model Diagnostics

Identify high leverage data points

add_lev <- hatvalues(le_add_model) > 2 * mean(hatvalues(le_add_model))
inter_lev <- hatvalues(le_inter_model) > 2 * mean(hatvalues(le_inter_model))
inter_3_lev <- hatvalues(le_inter_3_model) > 2 * mean(hatvalues(le_inter_3_model))

Identify influential data points

add_inf <- cooks.distance(le_add_model) > 4 / length(cooks.distance(le_add_model))
inter_inf <- cooks.distance(le_inter_model) > 4 / length(cooks.distance(le_inter_model))
inter_3_inf <- cooks.distance(le_inter_3_model) > 4 / length(cooks.distance(le_inter_3_model))

Check normality assumption for the additive model

qqnorm(resid(le_add_model))
qqline(resid(le_add_model))

shapiro.test(resid(le_add_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(le_add_model)
## W = 0.9827, p-value = 7.937e-15

Check normality assumption for the 2 way interaction model

qqnorm(resid(le_inter_model))
qqline(resid(le_inter_model))

shapiro.test(resid(le_inter_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(le_inter_model)
## W = 0.97639, p-value < 2.2e-16

Check normality assumption for the 3 way interaction model

qqnorm(resid(le_inter_3_model))
qqline(resid(le_inter_3_model))

shapiro.test(resid(le_inter_3_model))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(le_inter_3_model)
## W = 0.97225, p-value < 2.2e-16

Check equal variance assumption for the additive model

plot(fitted(le_add_model), resid(le_add_model))

bptest(le_add_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  le_add_model
## BP = 84.506, df = 12, p-value = 5.663e-13
hist(resid(le_add_model))

Check equal variance assumption for the 2 way interaction model

plot(fitted(le_inter_model), resid(le_inter_model))

bptest(le_inter_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  le_inter_model
## BP = 199.61, df = 77, p-value = 7.884e-13
hist(resid(le_inter_model))

Check equal variance assumption for the 3 way interaction model

plot(fitted(le_inter_3_model), resid(le_inter_3_model))

bptest(le_inter_3_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  le_inter_3_model
## BP = 366.83, df = 287, p-value = 0.0009927
hist(resid(le_inter_3_model))

Results

Discussion

Appendix